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ASYMPTOTIC INFERENCE FOR SEMIPARAMETRIC 
ASSOCIATION MODELS 

By Gerhard Osius 
Universitat Bremen 

Association models for a pair of random elements X and Y (e.g., 
vectors) are considered which specify the odds ratio function up to an 
unknown parameter 6. These models are shown to be semiparametric 
in the sense that they do not restrict the marginal distributions of 
X and Y . Inference for the odds ratio parameter 6 may be obtained 
from sampling either Y conditionally on X or vice versa. Generalizing 
results from Prentice and Pyke, Weinberg and Wacholder and Scott 
and Wild, we show that asymptotic inference for under sampling 
conditional on Y is the same as if sampling had been conditional 
on X. Common regression models, for example, generalized linear 
models with canonical link or multivariate linear, respectively, logistic 
models, are association models where the regression parameter f3 is 
closely related to the odds ratio parameter 6. Hence inference for /3 
may be drawn from samples conditional on Y using an association 
model. 

1. Introduction and outline. A common approach to describe the re- 
lationship between a random output variable Y of interest (e.g., a health 
status) and a random input vector X (e.g., consumption of tobacco, alco- 
hol and other risk factors) is by means of a parametric regression model 
which specifies the conditional distribution of Y given X = x up to an un- 
known parameter vector. In the most simple case Y is an indicator (e.g., 
for the presence of a disease) and the conditional distribution is binomial 
B(l,p(x)). The popular logistic regression model relates the logistic trans- 
form of p(x) and a vector z = h(x) 6 of covariates — obtained from x 
by a suitable function h — through logit p(x) = 7 + z T 6 with parameters 
76! and E M. s . The appropriate sampling scheme for this model is to 
sample Y conditionally on X = x for specified values of x. In epidemiology 
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this is called a cohort study, each of the J cohorts being determined by 
its value x. In contrast, the so-called case-control studies are obtained by 
sampling X conditional on7 = l (cases), respectively, Y = (controls). An 
important result by Prentice and Pyke [12] briefly states that asymptotic 
inference for the parameter 6 (but not for 7) in a case-control study may 
be obtained as if the data came from a cohort study. Actually their work 
covers the multivariate logistic regression model (cf. Example 3) for a ran- 
dom variable Y taking values in {0, 1, . . . , K} and was generalized by Scott 
and Wild [14] to multiplicative intercept models. Our aim is to extend these 
results to semiparametric odds ratio models (introduced in [9]) for random 
elements, including in particular random vectors Y and X, each with con- 
tinuous and/or discrete components. The odds ratio function OR(x,y) for 
the joint density p(x, y) of X and Y is defined as a cross-product ratio with 
respect to fixed reference values x° and y°: 



An equivalent description is given by the corresponding ratio for the con- 
ditional density p{y \ X = x) oiY given X — or vice versa. Under mild as- 
sumptions the joint distribution of (X,Y) is uniquely determined by the 
odds ratio function and the marginal distributions of X and Y; compare 
[9] or [10]. And conversely, for any pair of marginal distributions for X and 
Y and an odds ratio function there exists a joint distribution having these 
properties. The odds ratio function thus captures the complete association 
structure of X and Y by ignoring the information contained in the marginal 
distributions. A parametric odds ratio model specifies only the odds ratio 
function up to an unknown parameter vector 6, that is, 



This model is semiparametric in the sense that it does not restrict the 
marginal distributions of X and Y, but only the association structure. An 
important class are log-bilinear association models where the log-odds ratio 
function is bilinear with respect to given transformations z = hx(x) and 
v = hy(y), that is, 



In fact, some widely used regression models, for example, generalized linear 
models with canonical link function and multivariate linear, respectively, 
logistic regression models, have a log-bilinear association structure. The as- 
sumptions concerning the conditional distribution of Y given X in these re- 
gression models may be removed by passing to the corresponding log-bilinear 
odds ratio model. One advantage of odds ratio models over regression mod- 
els is that inference about the odds ratio parameter 6 may be obtained 



OR{x,y) 




p(x,y°) -p(x°,y)' 



log OR{x,y) = il) e (x,y). 




log OR(x, y) = z T 6v. 
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from sampling X conditionally on Y or vice versa. To prove this, we first 
observe that maximum likelihood estimation is invariant under both condi- 
tional sampling schemes, that is, the estimate 6 maximizing the conditional 
likelihood Lx\y f° r samples of X given Y also maximizes the correspond- 
ing conditional likelihood L Y \x for samples of Y given X — and conversely. 
Generalizing the result in Prentice and Pike [12] and Scott and Wild [14], 
we show that the estimated asymptotic covariance matrix for is invariant 
under both conditional sampling schemes, too. Hence asymptotic inference 
concerning the odds ratio parameter 9 may be obtained from a sample drawn 
conditionally on Y as if the sample had been drawn conditionally on X. 

The paper is organized as follows. In Section 2 we establish that the joint 
distribution of (X,Y) is uniquely determined by its odds ratio function and 
the marginal distributions (uniqueness theorem), and that each of these 
three components can vary independently of another (existence theorem). 
The latter result will be proved here under weaker assumptions than in [9] us- 
ing a different approach. Association models are introduced in Section 3 and 
some widely used regression models are recognized having a log-bilinear as- 
sociation. Although log-bilinear association is a natural and common choice, 
we derive the main results for more general odds ratio models determined 
by 

(1.2) \ogOR(x,y) = G(z,v,0), 

where G is a given (sufficiently smooth) function. Section 4 establishes that 
the maximum likelihood estimate is invariant under the usual sampling 
schemes: unconditional or conditional on X, respectively, Y. For log-bilinear 
association models the likelihood to maximize corresponds to a log-linear 
model for a suitable contingency table. Hence results on the existence and 
uniqueness as well as techniques to compute the estimate are already avail- 
able. 

Knowing that the estimate is invariant under conditional sampling given 
either X or Y, we establish in several steps our main result, that its esti- 
mated asymptotic normal distribution is invariant, too. In Section 5 we 
consider sampling X conditional on Y but maximize the "reverse" condi- 
tional log-likelihood £(X) — arising from conditioning Y on X — with respect 
to A = (0,7*), where -y* is a nuisance parameter vector. For the informa- 
tion matrix 1(A) = E{-D\ x l{\)) we show that the submatrix [I l {X)\ee 
of I -1 (A) corresponding to 6 is indeed the asymptotic covariance matrix of 
6. To establish the asymptotic normality of the estimate A, we first prove 
its consistency in Section 6. Our asymptotic approach applies to a fixed 
set {yo, ■ ■ ■ ,Uk} of values for Y to be conditioned upon and independent 
samples of size from each conditional distribution of X given Y = yk, 
such that n = X)fc n fc tends to infinity while the ratios n^/n remain fixed. 
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In Section 7 the asymptotic normality is derived more generally for any 
(weakly) consistent estimate A which solves the estimating equation at least 
approximately, that is, D\£(\) = op(y / n). Using the observed information 
J (A) = — D\ x i{\) as a consistent estimate of 1(A), we finally obtain the 
asymptotic normality of the odds ratio estimate 

The estimated asymptotic covariance matrix here is exactly the same as if 
sampling had been conditional on X for the observed x-values. 

We do not attempt to derive our results under the weakest possible as- 
sumptions but prefer a few easily interpretable conditions which will be 
verified for a log-bilinear association model under mild distributional as- 
sumptions. The approach adopted here is symmetric in X and Y so that 
interchanging X with Y in any argument entails its dual. 



2. The odds ratio function. Consider arbitrary nonempty spaces Vlx > re- 
spectively Oy, with o"-algebras Bx, respectively By, and denote the product 
o"-algebra on = tlx x by B. Let V the space of all probability measures 
P on (fi, B) and denote the marginal distributions of P on respectively 
Qy ) by P i respectively P Y . The definition of an odds ratio function for P 
requires a positive density with respect to a product measure and a natural 
choice is the product P XY = p x x P Y of the marginals. This leads to the 
subspace of probability measures P having a positive density with respect 
to P XY , or equivalently, are dominated by and dominate P XY : 



alP 
dP^ 



> oj = {P E V | P < P XY < P}. 



p 



For any P E "P< with density p = dP/ dP its odds ratio function OR 
with respect to fixed reference values x° E Qx and y° E fly is defined on 
x ft by 

(2.1) Oi? p (x,,)- P(x ' y) ^ (x °' y0) 



p(x,y°) -p(x°,y)' 

The choice of the dominating product measure P^ 1 ' is not essential (cf. [9]): 
replacing p by a positive density with respect to a product = i/x x vy of 
ex-finite measures yields the same ratio (2.1). Since the density p of P is only 
unique up to almost sure equality, the same holds for the odds ratio function 
0R P of P, which nevertheless will also be denoted simply by OR(P). The 
log-odds ratio function may be written in terms of the log-density 



(2.2) log OR p (x, y) = \ogp(x, y) + \ogp(x° ,y°) - logp(x, y°) - logp(x°,y). 
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It is convenient to view any P £ V as a joint distribution of a pair (X,Y) 
of random elements denned on some probability space with values in £1 and 
the odds ratio function of (X,Y) is defined by OR(X,Y) = OR{P). 

To show that the odds ratio function completely characterizes the asso- 
ciation between X and Y, we have to restrict the joint distribution P by 
requiring that its log-density logp is P^ y -integrable, or equivalently, that 
the Kullback-Leibler information [7] 



is finite. Any P in the subclass Vj = {P £ V<^ | I{P XY \ P) < oo} is uniquely 
determined by its marginal distributions and its odds ratio function. 

Theorem 1 (Uniqueness). Any P\,P2 £ Vj having the same marginals 

P* = P x ; pY = pY and the same odds ratio function OR(Pi) = OR(P 2 ) 
agree: P\ = P 2 . 

For a proof one easily establishes I (Pi \ Pi) = using (2.2); compare [10]. 

Next we want to "define" a distribution P on O by specifying its marginal 
distributions and its (log) odds ratio function. For given distributions ttx 
on fix and 7ry on fly and a measurable function ip on Q, we investigate 
under which conditions we can find a P G Vj with P x = ttx , P Y = Try and 

log OR(P) = ip. First of all, ip has to satisfy the obvious constraints 

Condition (OR1) . ip(x, y°) = 0, ip(x°,y) = for all x, y. 

Furthermore from P £ Vj and (2.2) we obtain two necessary integrability 
conditions: 

Condition (El), ip is ttx x "/ry-integrable. 

Condition (E2) . There exists 7rx-integrable /? : tlx — ► K. 7Ty-integrable 
7 : fiy — > R functions such that exp(^ — f3 — 7) is irx x 7ry-integrable. 

These conditions are also sufficient for the existence of the wanted P G . 

Theorem 2 (Existence). For distributions ttx on fix an d °n fiy 
and a measurable function ip on £1 x £1 the following statements are equiva- 
lent: 




(a) There exists P&Vj with P x = tt x , P = vry and log OR(P) = ip. 
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(b) There exists P^'Pj with log OR(P) = tp. 

(c) ip satisfies Conditions (OR1), (El) and (E2). 

The proof is given in Appendix A.l. A few remarks are in order. 

1. Conditions (El) and (E2) hold for bounded ip, for example, for continuous 
ijj and compact fl. 

2. The integrability of exp(^ — (3 — 7) in Condition (E2) holds if ip < (3 + 7. 
And if even \ip\ < /3 + 7, then Condition (El) follows, too. 

3. For finite Qy (or tlx) Condition (El) implies Condition (E2) for (3(x) = 
T, y \^i x ,y)\ and 7 = 0. 

4. Although P is uniquely determined by Theorem 1, there is no explicit 
formula for P available. In the proof P is given by an /-projection, which 
can only be obtained as a limit in an iterative procedure. Only for binary 
Y (and vector-valued X) the distribution P is easily available; compare 

[I] or [9]. 

5. A stronger version of Condition (E2) requiring exp('0 — (3) and exp(ip — 7) 
to be integrable was used in [9, 10] to obtain P as a limit of an iterative 
proportional fitting procedure. 

6. For finite spaces £lx and this result has long been known; compare 

[II] , Section 3.4. 

3. Association models. An association model for the joint distribution P 
of (X, Y) only restricts the odds ratio function of P and leaves the marginal 
distributions of X and Y arbitrary. To formulate such a model we assume 
that P has a positive density with respect to a fixed product measure v = 
vx x vy of o"-finite measures vx, respectively vy, on Qx, respectively £ly. 
Hence P is restricted to the class V XY = {P£V\P<^ii'<^P}C 'P<, which 
also restricts the marginal distribution P x of X to 

V x = {ttx probability measure on Q>x \ ftx *C fx *C ttx}, 

and the marginal P Y to the corresponding V Y . From now on all densities on 
£1, respectively £lx,^Y are taken with respect to the dominating measure 
v, respectively vx,vy. 

We consider parametric association models indexed by a parameter vec- 
tor 6 € IR 5 . For any let ipg be a measurable function on Q satisfying 
Condition (OR1). The parametric odds ratio model restricts the log-odds 
ratio function of P to log OR(P) = ipQ for some 6. To guarantee for any 
and any marginals ttx^y the existence of a joint distribution P with 
ij)8 = log OR(P) and these marginals, we assume the following bounding 
condition: 

Condition (OR2). There exist nonnegative measurable functions ipx 
on Qx and ipy on Qy with \tpg(x,y)\ < [ipx(%) +ipY(y)] • for all 6,x,y. 
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Furthermore we restrict 7rx to the class Vj = {ttx G V x \ ipx is ttx- 
integrable} and ny to the corresponding class Vj. Condition (c) in Theorem 
2 holds for any ttx G V X , Try G Pj and 0, and hence there exists a unique 
P G Vj with = nx,P Y = Try and log OR(P) = ipg. Thus a parametric 
association model (PAM) for distributions P in Vj Y = V XY PiVj is specified 
by the requirements 

(3.1) log OR(P) £{4> e \0e M s }, P x G Vj , P Y G Pj. 

This is a semiparametric model for the joint distribution P since the marginals 
are only slightly restricted by integrability conditions. By (2.2) a density 
p(x,y) of P G Vj Y satisfying (3.1) can be parametrized as 

(3.2) \ogp(x, y) = a + (3{x) + -y(y) + ip (x, y) 



with a£t and integrable functions (3 and 7. Identifiability may be achieved 
through the constraints (3{x°) = and j(y°) = 0, which will be assumed here. 
The integration constant a is determined by 

a = - log J exp((3 + 7 + ijjg) du 
and marginal density p x (x) of P x is given by 



logp x (x) = a + (3{x) + 5(x), 5(x) =\og 



exp(7(y) +ip e (x,y))dis Y (y) 



The conditional distribution of Y" given X = x belongs to V Y and the con- 
ditional density p(y \ X = x) satisfies 

(3.3) logp(y \X = x)= 7(7/) + tjj 6 (x, y) - 5{x). 

The integration constant 5(x) can be removed by passing to the density ratio 

(3.4) ^ p^Ix) =7(y)+ ^ (g ' y) ' 

Equation (3.4) may be viewed as a "regression model." Conversely, suppose 
a model for P is specified by (3.4) with an arbitrary integrable function 7 
and the parametric family tjjg. Then log OR{P) = ipe and hence the model 
(3.4) is semiparametric in the sense that it does not restrict the marginal 
distributions P x and P Y — provided they belong to the class V x , respec- 
tively Pj. In the latter case the regression model (3.4) is in fact equivalent 

to the association model (3.1). Note that for finite fly and counting measure 
vy the integrability condition imposed by P Y G Pj always holds. 
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An important class of parametric association models are log-bilinear as- 
sociation (LB A ) models with respect to measurable maps hx '■ ^x — ► R-^* 
and hy '■ — ► $L Ky , which will always be chosen here such that hx (x°) = 
and hy(y°) = 0. The parameter is a Kx x Ky -matrix and the log-odds 
ratio function is bilinear in the transformed variables hx(x) and hy(y) 

(3.5) tpe(x,y) = h x {x) T 6hy{y) for all x,y. 

Since \h x {x) T 0hy(y)\ <Jh x (x)\\ • \\hy{y)\\ ■ ||0||, Condition (OR2) holds for 
4>x{x) = \\hx{x)\\ 2 and tpy(y) = \\hy(y)\\ 2 . And the integrability condition in 
Vj and V J states that the second moments E{ \ \ h x (X ) 1 1 2 ) and E{ \ \ hy (Y ) 1 1 2 ) 

are finite. Any submodel of (3.5) specified by a linear restriction of the form 
6 = A r #*B with given matrices A, B and parameter matrix 0* yields a 
log-bilinear association too, with respect to h* x = Ahx, h Y = B/iy . 

Association models have been introduced long ago in the context of con- 
tingency tables, that is, when both X and Y have a finite range; see [4] for 
a review. The "RC association models" and "RC correlation models" in [4] 
are both association models in our sense, the former (but not the latter) 
being log-bilinear. Extensions of these models to multivariate contingency 
tables studied in Gilula and Haberman [3] also satisfy (3.1). Goodman [4] 
has generalized the bivariate normal distribution to a bivariate log-bilinear 
model in our sense, but did not establish its semiparametric nature. Re- 
turning to our primary focus, namely general random vectors X and Y , the 
following examples reveal that the association structure of some widely used 
regression models is in fact log-bilinear. 

Example 1 (Generalized linear models). Let Y be a univariate ran- 
dom variable, X an i?-dimensional random vector and suppose that the 
conditional density of Y given X = x belongs to the exponential family 
p(y | X = x) = exp{a((/))~ 1 [y ■ t(x) — b(r(x))] + c(y,(p)} with suitable func- 
tions a,b,c,r and a (dispersion) parameter <fi; compare [8]. Then the log- 
odds ratio function has the form tp(x,y) = a(<p)~ 1 ■ [t(x) — t(x°)] • [y — y°] 
and t(x) is a strictly monotone function of the conditional expectation 
fi(x) = E(Y | X = x), namely r(x) = X(fi(x)), where A -1 = U . A general- 
ized linear model specifies the conditional expectation via a link function 
9- 

(3.6) g{fi{x)) = a + z T (3, 

where z = hx{x) G is a known vector of formal covariates (obtained from 
i by a given function hx) and a£l, f3 E are unknown parameters. 
For G = A o g~ l and hx(x°) = the log-odds ratio function is ip(x,y) = 
a{4>)~ 1 • [G(a + z T (3) — G{a)\ ■ [y — y°]. If the canonical link g = A -1 is chosen, 
then 



(3.7) 



tp(x,y) =z T 0[y-y°] 
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is of the form (3.5) with hy(y) = y — y° and parameter 9 = a{4>)~ 1 (3. Note 
that the intercept a is no longer present in (3.7). Taking the log-bilinear as- 
sociation model (3.7) instead of (3.6) weakens the distributional assumption 
while still including the regression parameter f3 up to a positive constant 
a{4>)~ 1 . In particular a linear hypothesis C/3 = with a given matrix C is 
equivalent to C9 = 0, and for a vector c a one-sided hypothesis c T /3 > 
is equivalent to c T 9 > 0. Generalized linear models with canonical link are 
often used. First of all, normal conditional distributions N(fj,(x),a 2 ) of Y 
yield the classical linear model with a{4>) = <r 2 . Second, binomial conditional 
distributions B(fi(x),l) lead to logistic regression models. And finally, for 
Poisson conditional distributions Pois(fj,(x)) log-linear models are obtained. 
Note that for the latter two models we have a(4>) = 1 and hence = (3. 

The above semiparametric nature of the logistic regression model has been 
noticed before; compare Breslow, Robins and Wellner [1], who established its 
semiparametric efficiency under case-control sampling. However, the logistic 
regression model is the only one among generalized linear models for binary 
Y which is equivalent to an association model (3.1); compare [9] or Example 
2 below. And the resulting relation between the two conditional densities 
(given X, resp., Y) has been noticed before by Kagan [6]. 

Example 2 (Multivariate linear logistic regression). Extending univari- 
ate logistic regression to the multivariate case, suppose Y (e.g., a disease sta- 
tus) takes values in fly = {0, 1, . . . , K}, K > 1, and X is an i?-dimensional 
vector of observed covariates. Then C(Y \ X = x) is a multinomial distri- 
bution Mif+i(l,7r(a;)) with K + 1 classes and probabilities irk(x) = P(Y = 
k | X = x) > 0. Using the multivariate logistic transformation logit7Tfc(x) = 
log(7Tfc(x)/7To(x)) of 7r(x), the linear logistic regression model is given by 



M, 6k £ K are unknown parameters. Choosing y° = 0, the log-odds ratio 
function is 



where = (0i, . . . , 9k) is an S x K parameter matrix, and the function hy '■ 
Qy — y maps k > to the kth. unit vector and hy(0) = 0. Hence the 
linear logistic regression model is equivalent to the log-bilinear association 
model (3.9) — provided i?(||/ix(^)|| 2 ) is finite. As mentioned above, this also 
holds for submodels given by linear constraints, for example, 6^ = 9* for 
all k > 0. Although the model (3.8) has been known for a long time, its 
semiparametric character (based on Theorem 2) does not seem to have been 
established before for K > 2. 




(3.9) 



if>(x,y) = h x (x) T 9 k = h x (x) T 9hy(k) 
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Replacing z T 9h by an arbitrary function g(z, 9k) leads to a general logistic 
regression model 

logit7T fc (x) =7fc +g(z,6 k ), k = l,...,K, 

which is equivalent to the log-odds ratio model 

i>{x,y) =g(hx{x),G k ) = g(h x (x), Oh Y {k)). 

Example 3 (Multivariate linear regression). Let Y and X be random 
vectors taking values in K , respectively R^, and suppose that the condi- 
tional distribution of Y given X is multivariate normal, 

(3.10) <C(Y\X = x) = N K ((i Y (x),'E), 

such that the conditional covariance matrix 5] is nonsingular and does not 
depend on x. From the conditional log-density 

logp(y \X = x) = -i[log[(27r) x det(S)] + [y- A iy(x)] T S" 1 [y - M^)]] 

the log-odds ratio function with respect to y° = is ip(x,y) = [fiy(x) — 
/j,Y(x°)} T 'S~ 1 y. The multivariate linear regression model 

(3.11) /jL Y (x) = a + (3 T z 

with covariates z = hx(x) £ R 5 and S x K parameter matrix f3 has a log- 
bilinear association 

(3.12) i>{x,y) = hx{x) T 0y 

with parameter matrix 6 = — assuming hx{x°) = 0. Note that the 

regression parameter (3 may only be recovered from 9 if the covariance 
matrix 5] is known. However, any linear hypothesis C/3 = is equivalent to 
the corresponding hypothesis C9 = 0, and the latter may be tested using 
the semiparametric association model (3.12) instead of the regression model 
(3.11) with the distributional assumption (3.10). If instead of (3.10) we allow 
the conditional covariance matrix to depend on x, that is, C(Y \ X = x) = 
Nk(/J'y(x),'S(x)), then (3.11) leads to ip(x,y) = hx {x) T P'S^ 1 (x)y , which is 
not bilinear. 

The above examples reveal that important regression models may be gen- 
eralized to log-bilinear association models by ignoring the distributional as- 
sumption for the conditional distribution. Although log-bilinear association 
is a natural candidate, we also consider the more general association model 

(3.13) ipo(x,y) =G(h x (x),h Y (y),9) forallx,y, 

given by a fixed function G with G(0, — , — ) = G{— , 0, — ) = 0. We assume 
throughout that the function G satisfies the following regularity condition 
(although some results also hold under weaker assumptions): 
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Condition (Rl). G(z,v,0) is thrice continuously differentiable with 
respect to 6 for all z £ v 6 /iy[Oy] and the derivatives are contin- 

uous in z and v. 

Further properties of the functions hx,hy and G will be assumed later 
in Conditions (R2") and (MC). 

4. Estimation. For a given data set (xi,yi) with i = 1, . . . , n we want to 
estimate the association parameter 6 of the model (3.13) under uncondi- 
tional sampling from the joint distribution of (X, Y) and conditional sam- 
pling of Y given X or vice versa. Not surprisingly the maximum likelihood 
estimate 6 under any of these three sampling schemes may be obtained as 
a solution of the same estimating equation. 

4.1. Unconditional sampling. For unconditional sampling the data set 
(xi,yi) is an independent sample from the joint distribution of (X,Y). 
Suppose there are J + 1 > 1 different x-values and K + 1 > 1 different y- 
values observed and denote the corresponding subsets of Ox and fiy by 
= i x (o)i ■ ■ -, X (J)} and = {V(o), ■ ■ -,y(K)}- If r jk is the observed fre- 
quency of (xfj\,yrty), then the likelihood is 

J K 

l X y = n np( x u)>y(k)) rjk = l x\y ■ l y 

j=0k=0 

with a conditional and a marginal likelihood 

(4.1) l xiy = n ri i Y = ^)) rjfe > l y = n p y (y«r + * 

fc=0j=0 fc=0 

(the subscript "+" indicates summation over the replaced index). The model 
does not restrict the marginal distribution of Y and hence the empirical 
density with respect to counting measure v Y on £l Y , 

(4.2) p Y (y(k)) = -r+k for k = 0,...,K 

is the usual nonparametric estimate. If we restrict the distribution P Y to 
the class Vy of all distributions with finite support fiy, then Ly is a multi- 
nomial likelihood which attains its maximum for (4.2). Hence, for estimation 
purposes we may restrict the marginal P Y to V Y and maximization of Lxy 
is equivalent to separate maximization of Lx\y an d Ly, because the latter 
two have no common parameters. 

Interchanging X and Y, we split the likelihood as Lxy = L Y \ X ' Lx and 
by the above argument we may additionally restrict P x to the class V x of 
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all distributions with finite support £l x - Under these restrictions for both 
P x and P Y the likelihood Lxy is a multinomial likelihood for the observed 
( J + 1) x (K + 1) -contingency table (rj k ). Hence, estimation of is reduced 
to a multinomial model whose probabilities Pjk = p( x (j),y(k)) satisfy the 
log-odds ratio model 

\og{p jk poo/PjoPok) = ^e{x(j) , V(k)) ='■ ^jfc(^) for a11 3 and k 

with respect to the reference values x° = xr \ and y° = y( ) • The parametriza- 
tion (3.2) now involves only a finite number of parameters 

(4.3) logp jk = Pj + 7fc + ipjk(0) " lo g EE ex P[/??' + 7fc + ^jkifi)] 

\ j k 

namely Pj = P(x(j\), -y k = 7(2/^)) and 6 with /?o = 7o = 0. Instead of max- 
imizing Lxy, it is typically preferable to maximize either Ly\x or Lx\y 
using the parametrization of the conditional probabilities p k u =Pj k /pj+ or 
Pj|fc =Pjk/p+k given by (3.3) and its dual 

logy>fc|j = 7fc + VyfcW - Sj, log^ife = /3j + V'jfc(^) - ejfc) 

where the parameters 5j, respectively, e k are determined by the remaining 
ones. 

4.2. Conditional sampling. When sampling is conditional on values for 
Y taken from fl Y = {V(o)i ■ ■ ■ iV(K)}-> sa yj then the data set (xi,t/i) with 
i = 1, ... ,71 is partitioned into K + 1 independent subsamples given by the 
values of j/j, such that each subsample (xi) with yi = yn^ is an independent 
sample from the conditional distribution C(X \ Y = y( k ))- Instead of max- 
imizing the appropriate likelihood L x \y we can equivalently maximize the 
unconditional likelihood Lxy or even the "reverse" conditional likelihood 
L Y \x- The latter is preferable from a computational point of view, when the 
nuisance parameters 7^ are less than those of Lx\y, that is, for K < L. A 
dual argument applies if sampling is conditional on values for X taken from 

= { x (o), •••>£( J)}- 

4.3. Log-bilinear association. In the log-bilinear association model (3.5), 
the odds ratios may be written as ipj k (6) = zj6v k with zj = hx(x(j)) and 
v fc = ^y(y(fe))) or m niatrix notation 

rl>{0) = Z0V T G R JxK , Z = (z j7 ) G R JxKx , V = (v kl ) eR KxI<Y . 

Then (4.3) reduces to a log- linear model for the probabilities pj k , 

(4.4) Iogp jk = a + f3 j +-f k + zj0v k 
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induced by the covariates zj,v k and results by Haberman [5] on the existence 
and uniqueness of maximum likelihood estimates in log-linear models apply. 
In particular the estimate p = (pj k ) is unique (if it exists) and hence the 
estimate 6 is unique too, provided the parameter 9 is identifiable. 

For sampling conditional on Y, the values y^) should be chosen such that 
the rank condition holds: 

Condition (Rk) . The Ky x If -matrix V T = (vi, . . . , vk) has rank Ky. 

This condition will be assumed whenever the log-bilinear association model 
is used. Then a convenient reparametrization is available: 

(4.5) i> jk (e) = zfe k , ~e k = ev k GR Kx 

with a Kx x K parameter-matrix 6 = (6\, . . . , Ok) = 0V T . The observed 
matrix Z of covariates will typically have rank Kx and then 6, respectively, 
6 is uniquely determined by tp(6) = ZOY T = Z0 and hence identifiable. In 
general identifiability of 6 is guaranteed by Condition (C3) in Section 6. 

5. Conditional likelihood. Although the maximum likelihood estimate 6 
of the association parameter 6 may be obtained by maximizing either of the 
two conditional likelihoods, the stochastic properties of the latter depend 
on the sampling scheme. Let us now consider sampling conditional on Y— 
which can be preferable from a practical point of view (even for regression 
models) — and derive properties of the "reverse" likelihood L Y \x- The ad- 
vantage of Ly\x over the appropriate likelihood Lx\Y is that it usually has 
fewer nuisance parameters since K is fixed by the sampling design whereas J 
will typically increase with the number of observations — unless £lx is finite. 
An important example for finite £ly are case-control studies (called choice- 
based samples in econometrics) for which asymptotic inference on 9 in the 
(general) logistic regression model may be obtained as if sampling had been 
conditional on X; compare [12] and [14]. We want to extend these results 
to arbitrary Y (e.g., vectors with continuous and/or discrete components) 
and association models. 

Instead of a data set (xi,yi) we now consider the underlying random 
elements. It is convenient to represent the sample as a compound vector 
of random elements X = (X k i) indexed by k = 0, . . . , K and i = 1, . . . , n k . 
Omitting now the parentheses in y^ and X(j\, each X k i is distributed as 
X k ~ C(X | Y = y k ). As above rj k denotes the frequency of (xj,y k ) in the 
sample (x k i, y k ) and the empirical distribution on Q Y = {VOi ■ ■ ■ > Vk} is given 
by the proportions f k = n k /n, where n = n + is the total sample size. Re- 
placing in P the marginal distribution of Y by the empirical distribution 
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(4.2) yields a joint distribution P* on Vtx x given by the density p* with 
respect to vx x Vy\ 

p*(x,y k ) = r k -p(x | Y = y k ) for all x,k. 

The marginal density of Y under P* is p* Y (y k ) = ^fc and the marginal, 
respectively, conditional density for X is 

K 

p* x (x) = ^2 r k ■ p(x \ Y = y k ), respectively, 

k=0 

(5.1) 

*/ n * / iv \ r fc • p(x | K = y k ) 

Pk( x ) - = p {yk \x = x = 7Y — . 

Equation (3.3) yields the parametrization logp^(x) = r y k + ^pe(x,yk) — S*(x) 
with nuisance parameters 7^ ="y*(yk) arid 5*(x) = log[^ exp( r yi+ipg(x,yi))], 
hence 

(5.2) pt(x)- ^k + Mx,Vk)] 



E/exp[7f + i/}o(x,yi)]' 

Choosing the reference value y° = yo we have 7q = 0, and the nuisance pa- 
rameter is 7* = (t! , . . . ,7^-) 6 K ■ Finally, the logarithm of the conditional 
likelihood L Y \x may be written in terms of the compound parameter vector 
A:= (e,f)eR s +^ : 

K n k 

£(A): = logL Y | X = ^£logp^X fei ) with 

fc=0i=l 

(5.3) 



logPfe(^i) = 7fc + Tpe(X ki ,y k ) - log 



]T]exp(7f +i>e{X ki ,yi)) 
1=0 

Notice that £(A) is the log-likelihood of the multivariate logistic regression 
model 

(5.4) logitp£(x) =7fc + ip e (x,y k ), k = l,...,K, 

which is nonlinear in general. The estimate A maximizing £(X) satisfies 

K n k 

(5.5) D x £(\) = J2Y,D^ogpt(X H ) = 0, 

fc=0i=l 

where D\ denotes the differential operator with respect to A. The basic 
stochastic properties of the solution of the estimating equation (5.5) depend 
on the moments of the estimating function D\£(\) and its derivative. The 
first important property (proved in Appendix A. 2) is that its expectation 
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is zero — which is not obvious since ^(A) is not the log-likelihood for the 
underlying sampling: 

(5.6) E[D x e(X)}=Y / n k -E[D x logpUX k )}=0. 

k 

Next, the components of the covariance matrix E(A) := Cov(D^^(A)) are 
given by 

(5.7) E st (A) =$>*• Cov(D Xs \ogp* k (X k ),D Xt logp* k (X k )) 

k 

and for the partial second derivatives we get 

(5.8) J st (X) := -Dl sXt £(\) = - E E D lxt l°gP*(**i) 

k i 

with expectation (cf. Appendix A. 2) 

(5.9) I st {\) := E(J st (\)) = 5> fc - E(D Xs logp* k (X k ) ■ D Xt logp* k (X k )). 

k 

Since i(X) is not the log-likelihood for sampling conditional on X, the ma- 
trices S(A) and 1(A) need not be equal, but from (5.7) their difference is 

(5.10) I st (X) - E st (A) = E n fc • E ( D ^ lo SPt(X k )) ■ E(D Xt logp* k (X k )). 

k 

From now on we assume the essential: 

Condition (R2). S(A) = Cov(Z?a^(A)) is positive definite for all A. 

Two equivalent formulations (cf. Appendix A. 2) are 

Condition (R2'). 1(A) is positive definite for all A. 

Condition (R2") . For all 0, all s e R s and ci, . . . , c K G M: D e ipo(X, y k ) ■ 
s = c k for k = 1, . . . , K almost surely s = 0. 

In the last formulation — which does not include the nuisance parameter 
7* — we can replace X by X k , since their distributions belong to V and 
hence dominate each other. 

Using the block notation for an {S + K) x (S + K) matrix, say 

a fundamental result can be derived (cf. Appendix A. 2) by adopting the 
method in [12]. 
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Theorem 3. For any A 


w 



(a) 1(A) -S(A) = 1(A)- 



•1(A), 



where the K x K -matrix W is the sum of the diagonal diag(n 1 , . . . , fi^ 1 ) 
and the constant matrix (tIq ), that is, Wki = A^n^T 1 + 1 with the Kro- 
necker's A. 



(b) [I- 1 (A)], e = [I- 1 -I](A).I- 1 (A)] 



oo- 



The matrix in (b) will later turn out to be the asymptotic covariance 
matrix of the estimate 9. 

Log-bilinear association: Using (4.5) and 9 (instead of 9) the model states 

(5.11) Mx,Vk)=z T 9 k withz = /i x (x), 9 = (9 1 ,...,9 K )€R KxxK 

and is equivalent to the linear logistic regression model given by (5.2), that 
is, 

logitpfc(x) =jl + z T 9 k , k = l,...,K. 

Condition (R2") holds if hx (A) is not concentrated on a hyperplane of R x , 
that is, if the following condition is met (cf. Appendix A. 2): 

Condition (R2)lba- For all s G R Kx : s T hx(X) is constant almost 
surely =^ s = 0. 

6. Asymptotics and consistency. We now turn to the asymptotic prop- 
erties of the estimate A = (#,7*) in the model (3.13). Our asymptotic ap- 
proach assumes that set Q Y = {VOi ■ ■ ■ > Vk} of conditional values will re- 
main fixed while all subsample sizes n k tend to infinity with fixed ratios 

= n k /n > for all n and k. Hence the nuisance parameter 7*, the dis- 
tribution P* and its conditional densities p%(x) do not vary with n. The 
true parameter will now be denoted by A° = (0°,7°) instead of A and the 
notation E, P, etc. now refer to expectations, probabilities, etc. with respect 
to A°. The conditional log-likelihood £^ n \\) — the additional index n is sup- 
plied if necessary — need not have a unique maximizing argument A for every 
sample. Concerning uniqueness, the strong law of large numbers yields for 
the matrix j( n )(A) = -D 2 xx £^ n \\) from (5.8) 



-j( n )(A) 



n 

k=0 



1(A) := J2 T k ■ E(-D 2 xx logp* k (X k )) almost surely. 



(6.1) 

The matrix 1(A) = ^I(A) is positive definite by Condition (R2') which im- 
plies — D\ x ^ n \X) = — j( n )(A) is negative definite for almost all (i.e., all 
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except finitely many) n, almost surely. Hence — almost surely — the function 
£( n \\) is strictly concave for almost all n, which implies that D\£^ n \\) = 
has at most one solution A, which also maximizes £^ n \\). Since the unique 
existence of a maximizing argument A of &^ (A) is not guaranteed for every 
n, we consider any sequence of (measurable) functions A^ n ^ as estimators if 
the estimating condition is met: 

Condition (CI). If^ n )(A) has a maximizing argument A, then £( n \\( n )) = 
Max A £( n )(A). 

To establish the consistency of such a sequence A^ we assume an inte- 
grability and an identifiability condition: 

Condition (C2). E{^ x (X k )} < oo for all k = 0, . . . ,K. 

Condition (C3) . -001 (A, Uk) = i>e 2 Vk) f° r k = 1, . . . , K almost surely 

=> e l = e 2 . 

As in Condition (R2"), we can equivalently replace X by X k in Condition 
(C3). In Appendix A. 3 we derive the asymptotic (unique) existence and 
strong consistency of the estimator: 

Theorem 4 (Consistency). Under Conditions (C1)-(C3) the following 
properties hold almost surely: 

(a) For almost all n there exists a unique A maximizing ^ n >(\), namely 
AW. 

(b) For almost all n there exists a unique solution A of D x £ {n) {\) = 0, 
namely )S n \ 

(c) AW = (0( n ), 7 *(™)) >X° = (0°,7°). 

n^oo 

Log-bilinear association: In view of ip x (x) = H^-x^)!! 2 ; Condition (C2) 
reduces to a moment condition for Z k = h x (X k ): 

Condition (C2) L ba- £{||Z fc || 2 } < oo for all k = 0, . . . , K. 

And, using the parametrization (5.11), Condition (C3) reduces to 

h T x (X)0 kl = h T x {X)0 k2 
for k = 1, . . . , K almost surely =^ 0\ k = 9 2k f° r an k, 
which is implied by the stronger Condition (R2)lba- 
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7. Asymptotic normality. Let us finally establish the asymptotic normal- 
ity for a sequence X^ of estimates. Instead of assuming Condition (CI), we 
derive the asymptotic distribution for any weakly consistent sequence A^ n ^ 
solving the estimating equation at least approximately, that is, we only as- 
sume 



Condition (Nl). ZV (n) ( A(n) ) = op(V™)> respectively, tT^-DJ^ 



x 



p 



(A») >0 



Condition (N2). \W — ^A°. 

n— »oo 

Obviously both conditions hold under the assumptions of Theorem 4. 
Furthermore we assume the following consistency results, which are derived 
later (Theorem 6) from Condition (N2) and additional moment conditions: 

Condition (N3). - $ J (n) (A° + t[X^ - A°]) dt — ^->I(A°). 



Condition (N4). ij( n )(A^) — ^I(A°). 

n n—*oo 

In Appendix A. 4 we derive the asymptotic normality of the estimate as fol- 
lows, where A 1 / 2 , respectively, A 1 / 2 denotes the generalized Moore-Penrose 
inverse, respectively, the symmetric root of a positive semidefinite matrix A, 
and I is the identity matrix. 

Theorem 5 (Normality). Any sequence X^ of estimators with Condi- 
tions (N1)-(N3) is asymptotic normal 

(a) v^IA^-A ]— ^->JV(0,r 1 (A°)-S(A o )-r 1 (A o )) with E(A) :=£ fc ?v 

n— >oo 

Cov(D x \ogpl{X k )), 

(b) ^[ew -e°\-^N(o,$r\\°)\ ee ). 

n— >oo 

Corollary. If in addition Condition (N4) holds, then 

(c) ([jW(AW)-]Ji 2 )-[»W - 0°)— ^iV(0,I). 

n— >oo 

Less formally (a) and (b) state 

A ~ N(X°,I~ 1 (X°) • E(A°) • I-^A )), 6 ~ JV(0, [I-^A )]^). 
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J(A) is a consistent estimate of I(A°) by Condition (N4), and will be positive 
definite for almost all n (almost surely) by (6.1). In this case, (c) states 

(7.1) e~N{e,[l- l {\)} ee ). 

Notice that for an observed data set, the estimated covariance matrix 
[J _1 (A)]g0 (where the random variables are replaced by observations) is 
identical to the corresponding matrix under sampling conditional on X (in- 
stead of Y). In this sense the estimate 6 and its estimated asymptotic nor- 
mal distribution are invariant under sampling conditional on either Fori. 
Hence asymptotic inference (i.e., tests or confidence regions) for the associa- 
tion parameter 9 based on the asymptotic distribution (7.1) of the estimate 
6 is invariant under both conditional sampling schemes, too. 

The above Conditions (N3) and (N4) will now be derived from the con- 
sistency Condition (N2) and additional properties of the function G. For 

K 

H r (z\6) = J^\Dg T G(z,h Y (.yk),0)\, 

k=0 
K 

H rs (z \0) = J2 \Dl e G(z,h Y (y k ),0)\, 

k=0 
K 

H rst (z \G) = J2 \D$ rMt G(z,h Y (y k ),e)\, 

k=0 

the following result is proved in Appendix A. 4. 

Theorem 6. Conditions (N3) and (N4) follow from (N2) and the mo- 
ment condition (MC)lba 

Condition (MC). There exists e° > such that for B(0°) = {6 \ \\0 - 
< £°} and all k = 0, . . . , K the following functions of Zfc = hx{X}-): 

sup H r {Z k \ef, sup H st {Z k \0) 2 , sup H rst (Z k \0) 

0eB(0°) 065(0°) 0eB(0°) 

have finite expectation for all r,s,t = 1, . . . , S. 

Hence the requirements for Theorem 5 are met if Conditions (MC) and 
(C1)-(C3) in Theorem 4 hold. 

Log-bilinear association: The log-bilinear association model is based on 
the function G(z, v, 6) = z T 6v with partial derivatives Dg lm G(z, v, 0) = Z[V m 
and vanishing higher derivatives. Hence Condition (MC) holds if Condition 
(C2)lba is strengthened to 

Condition (MC)lba- £{||Z fc || 3 }<cx) for all k = 0,...,K. 
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8. Discussion. Association models for a pair of random elements (X, Y) 
do not restrict the marginal distributions of X and Y but only their odds 
ratio function. We have looked at parametric association models which in- 
clude the important log-bilinear association models. An advantage of these 
models is that inference about the odds ratio (or association) parameter 
vector may be obtained from sampling Y conditional on fixed values of X 
or vice versa. The maximum likelihood estimate 6 is the same under both 
conditional sampling schemes, and asymptotic inference concerning 6 is in- 
variant with respect to sampling, too. More precisely, we have shown that 
for samples conditional on Y, the estimate 6 maximizing the "reverse" con- 
ditional likelihood L Y \x is consistent, asymptotic normal and its estimated 
asymptotic covariance matrix is the same as if sampling had been condi- 
tional on X . These results have been obtained much earlier for discrete Y 
with finite range for the multivariate linear logistic regression model in [12] 
and for the general logistic regression model in [16] (for X with finite range) 
and [14]. Our result allows both X and Y to be arbitrary random vectors 
each having discrete and/or continuous components. 

Furthermore, asymptotic inference for the regression parameters (3 in 
widely used regression models is available when sampling is conditional on Y 
(instead of X). For example, in log-linear regression models for Poisson vari- 
ates we have [5 = 9 and hence inference on (3 may also be obtained from sam- 
ples conditional on Y. Even in the linear regression model fi(x) = a + z T f3 
with covariate vector z = hx(x) and C(Y \ x) = N(fi(x),a 2 ), asymptotic 
inference for 6 = a~ 2 (3 may be obtained from samples conditional on Y— 
including tests of a linear hypothesis CO = 0, which is equivalent to C/3 = 0. 
However, confidence regions are only available for 6, but not for f3, unless 
an estimate of a 2 from another sample is at hand. This extends to the mul- 
tivariate case where the conditional distribution of Y is multivariate normal 
Nk(/J>(x), S) and the odds ratio parameter is given by 6 = /35] -1 . Although 
sampling conditional on Y seems unnatural for a regression model, it may 
be very attractive if such a sample is much easier (e.g., cheaper or quicker) 
to obtain. The advantages of (retrospective) case-control over (prospective) 
cohort studies can thus be extended to an arbitrary response vector Y, for 
example, to infinite discrete response categories or to a continuous response 
Y. In the latter case we do not get confidence intervals for (3, but tests 
for linear hypothesis — which may be of primary interest (e.g., in a clinical 
trial) — are available. 

Related, but different, semiparametric models for random vectors X = 
(X±, . . . , Xj) and Y = (Y±, . . . , Yj) are given by multivariate copulas which 
specify parametric distributions on [0, 1] I+J with uniform marginals. How- 
ever, a copula is not an association model in our sense (cf. [9]) because a 
copula only leaves the marginal distributions of all univariate components 
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Xi and Yj arbitrary, but the marginal distribution of the vectors X, respec- 
tively, Y are restricted through the parametrization of the copula, unless 
both X and Y are univariate. And even in the latter case, the odds ratio 
function OR(X, Y) cannot be recovered from the corresponding copula un- 
less both marginal distributions of X and Y are known. Hence the rather 
general semiparametnc associations models considered here do not fit in the 
framework of copulas. 

APPENDIX: PROOFS 

A.l. Proof of Theorem 2 (existence). We have already seen that (b) 
implies (c) and it remains to derive (a) from (c), which uses the concept of 
an /-projection and heavily relies on results by Csiszar [2] and Ruschendorf 
and Thomsen [13]. Setting tt = irx x iry we first conclude from Condition 
(E2) the existence of R G V with 7r-density 

r = exp(^ — [3 — 7 — a) > 0, a = log / exp(ip — [3 — 7) dir 



and the wanted P will be the /-projection of R on £ = {P G V | P x = 
irx,P Y = 7ry}. The integr ability of ip, [3 and 7 implies 

I (it \R) = J log f-^ d-n = J (a + (3 + 7 - dir < 00 

and since tt £ £, we conclude from Theorem 2.1 in [2] that R has an /- 
projection P on £. Application of Theorem 3.1 in [2] to the set 

T = {fx + h I fx G A(vrx), fy e A (Try)} C d(P) 

yields that the i?-density pn of P satisfies pr = exp(/i) 7r-almost surely, 
where h belongs to the closure J-~ of J- in C%(P). Ruschendorf and Thomsen 
[13] pointed out that T need not be closed in C\{P) — which was claimed in 
the proof of Corollary 3.1, case (B) in Csiszar [2]. 

Now R<^ir implies that exp(/i) > is an i?-density of P and hence R 
P <C R- Furthermore r > yields R <C vr <C R and hence P G 7- > <, since 
P XY = 7T. From Theorem 2.2 in [2] we obtain 

/(vr I P) + I(P I R) < /(tt \R)<oo, 

which establishes P G Vj. Finally OR(P) = ip remains to be shown. From 

P <S P XY and Proposition 2 in [13] we conclude the existence of measurable 
functions a: fix — ► R and 6:f2y — > R, such that h(x,y) = b(x) + c(y) P- 
almost surely, and hence /?-almost surely. Hence a 7r-density of p is given 
by 

dP dP dR ., 

— — = — • — = exp(ft + c) • r = exp(6 + c — (3 — 7 — a + ip) 
dir dR dir 

and a direct calculation yields log OR(P) = tp as required. 
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A. 2. Proof of the results in Section 5. We start with some preliminary 
results. The derivatives of logp k are given by 

Dx B P* k (x) 



(A.l) 



D\ a logp* k (x) 



p%{x) 
p* k (x) 



D\ 8 logpfc(x) • D\ t \ogp%(x). 



(A.2) 



For any set of measurable functions Gk(x) we obtain from (5.1) a key equal- 
ity: 

]Tr fc • E(G k (X k )) = 5> fc ■ E(G k (X) \Y = y k ) 

k k 

X G k (x) ■ p%{x) ■ p* x (x) dv x (x) 
Y^G k {X)-pl(X) 

k 

where E* denotes expectation with respect to P*. 

In particular, we get for G k {x) = H(x) ■ D\\ogp* k {x) and any measurable 
H(x) 

J2r k -E[H(Xk)-D x logpUX k )] 
k 

^H(X)-D x logpUX)-pUX) 



E* 



(A.3) 



E* 



E* 



H(X)-J2DxpUX) 



0, 



since p+(x) = 1. In particular, (5.6) follows for H(x) = 1. 

PROOF of (5.9). Choosing G k (X k ) = p%{X k r l ■ Dl sXt p* k (X k ) in (A.2) 
yields 



x; n • Eipux.r 1 • Di sXtP i{x k )\ = e* 
k 

and (5.9) follows using (A.l): 



J2Dl Xt p* k (X k ) 







E(J st (X)) =n-Y j r k - E(D Xs logp* k (X k ) • D Xt ]ogp* k (X k )). 

k 
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□ 

Proof of Conditions (R2) 44> (R2'). By (5.10) 1(A) is a sum of 
S(A) and a positive semidefinite matrix. Hence 1(A) is positive semidefi- 
nite, and even positive definite, provided Condition (R2) holds. Conversely, 
let Condition (R2') hold. Then t T £(A)t = Var(t T D A £(A) T ) = implies that 
t T L> A ^(A) T is constant almost surely, and hence t T D 2 xx £(\) = D x [t T D X £(\) T ] = 
almost surely. Thus t T I(A) = E(t T D 2 XX £(X)) = 0, which implies t = by 
Condition (R2'). Hence Condition (R2) holds. □ 

Proof of Conditions (R2') 44> (R2"). 1(A) is positive semidefinite 
(as already observed) and hence Condition (R2') is equivalent to 

(A.4) For all t G R s+K t T I(A)t = t = 0. 

For any t G R s ' + - ft ' we get from (5.9) 

t T I(A)t = $> fc ■ E(\\D x logpl(X k ) • t|| 2 ) 

k 

and since the distributions of X k and X dominate each other: 

(A.5) t T I(A)t = & D x logp* k (X) -t = 

for k = 0, . . . , K almost surely. 

To derive Condition (R2') from Condition (R2"), let t T I(A)t = 0. From (5.4) 
we get 

(A.6) logitp£(X) = logp* k (X) - logp* (X) = 7 * k + MX, y k ) 

and for t = (s, — c) with s G M 5 , c = (ci, . . . , ck), we obtain from (A.5) almost 
surely 

= D A logit^(X) • t = D e \og\t p* k {X) ■ s - D^logitp^(X) • c 

(A.7) 

= D e ijje{X, y k ) • s - c k for all k = 1, . . . , K. 

And from Condition (R2") we conclude s = as well as c k = for all k, and 
thus t = 0. 

Conversely, suppose Condition (R2') holds. To establish Condition (R2"), 
it suffices to show that (A.7) implies s = 0. From (5.2) and (5.4) we get 

p* (X) = ^exp[logitpt(X, yi )} 

D x logp* (X) ■ t=p* (Xy 1 J2^o&tpt(X, yt)] ■ D x logitpt(X, yi ) ■ t. 

l 
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Hence (A. 7) — and logitpo = — imply Dxlogp^X) • t = almost surely. 
From (A. 6) we get D^logp^(X) • t = for k = 0, . . . ,K almost surely, and 
(A. 5), (A. 4) establish t = and hence s = 0. □ 

Proof of Theorem 3. Part (a) is equivalent to three equations: 
(&)ee lee ~ See = le-y ■ W • 

( a )e 7 I©7 _ ^e-y = Ifl-y • W • I 77 , 

(a)^,^, ^-tt = I-y 7 * • I-^-y. 

Some prerequisite results are derived first using the notation 
b sk = E[Dq s logp* k (X k )] el, b fc = (b lfc , . . . , e R s , 

c mk = E{D 7 * m logp* k (X k )] £ R, c fc = (ci fc , . . . ,cjflfe) £ 

B = (bi, . . . ,b K ) € M 5xA , B = (b , . . . ,b K ) £ M Sx ^ +1 \ 

C = ( Cl , . . - ,<*-) £ R KxK , C = (c , . . • ,c K ) £ M^ x ( x+1 ), 

N = diag(ni,...,njr) £M AxA , N = diag(n , . . . ,n^) £ M A ' x( ^ +1) . 
From (5.3) we obtain the partial derivatives 

D e s logpfc(x) = Dg s il} (x,y k ) -^2p*(x) ■ De s tj; g (x,yi), 

i 

logpfc(x) = A fcm - p^(s) 
and (5.9) yields 

J At7^ = n Y. T k ■ E{D Bs \ogp* k {X k ) ■ D rm \ogp* k {X k )) 



k 

Urn. 



E(D ds \ogp* m {X k )) -nJ2n- E(p* m {X k ) • D Ba logp* k (X k )) 

k 



= n m ■ E(D 6s logp* m (X k )) [cf. (A.3) for H(x) =p* m (x)}. 
Hence Ie al ^ = n m ■ b sm , /-.^ = n m ■ Q m , or in matrix notation 



(A.8) Ie 7 = B • N, I TT = C N. 

From (5.6) we have J2k n k ' b s k — and • c m fc = 0, or in matrix notation 

(A. 9) O = raobo + Bn, = noCo + Cn, n = (rai, . . . ,uk)- 

Using the constant vector e + = (1) and constant matrix e+e+ = (1) we thus 
obtain 

I 07 W = B N[n ~ 1 e+e+ + N" 1 ] = n^B • n • + B = -b • + B 
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and similarly with C instead of B 

I 77 • W = C • N • W = -c • + C. 

Now (a)g^ is obtained as follows: 

I 07 W 1^ = [B - b • e£][C • N] T = B • N • C T - b • [C • n] T [cf. (A.8)] 

= B-N-C T + b -no-c P = B-N-C T [cf. (A.9)] 

= I 07 -S 07 [cf. (5.10)] 

And (a) ee, respectively (a) 77 , is established similarly (replace B and bo by 
C and Co, respectively, vice versa). Hence (a) holds, and multiplication with 
I" 1 (A) yields (b). □ 

Proof of Condition (R2) L ba => (R2") . Suppose for s = (si , . . . , s K ) e 
j&KxxK and ci, . . . ,ca- € R we have for all k = 1, . . . ,K 

c k = D e ip (X,y k ) ■ s = ^2D ei 7p e (X,y k ) ■ s t = h x (X) T ■ s k almost surely. 

I 

Then Condition (R2)lba implies s k = for all k, and hence s = 0. □ 

A. 3. Proof of Theorem 4 (consistency). The proof is based on the inge- 
nious ideas from Wald [15]. The log-odds ratio ^g(x,y) in the model (3.13) 
depends only on the vectors z = hx(x) and v = hy(y)- Therefore we regard 
p* k (x) = p k {z | A) as a function of z and A using the notation 

G k (z,0) := G(z,h Y (y k ),0) = tp (x,y k ), 

m (v I X) - e Mlt+G k (z,9)} _ 



T] k (z I A) := logp fc (z | A) = 7^ + G k (z, 6) - logfcexpfa* + G?j(z, 0)]J . 
We first show for Z k := hx{X k ) 

(A.10) E{\r/ k (Z k | A)|} < oo for all A and k = 0, . . . , K. 
From 7q = = Go(z, 6) and po(z | A) < 1 we get 

\ m (z | A) | = log exp[ 7/ * + G l (z, 0)]J < \og(K+ 1) + || 7 *|| +Max \G t (z, 6) 
And Condition (OR2) yields 

(A.n) \G l (z,e)\<[i )x (x) + ^ Y (yi)}-\\e\\, 
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which in view of Condition (C2) proves (A. 10) for k = 0. For fc>0we get 

\ m (z | A) | = | 7 fc + G k (z,0) + 7?o(z, A) | < || 7 *|| + \G k (z,0)\ + \ Vo (z | A)|. 

Hence (A.ll) and Condition (C2) establish (A. 10). 
Next we prove three basic lemmas. 

Lemma A.l. For any A f A° : £f =0 r fc ■ E{ Vk (Z k \ \)-r] k (Z k | A°)} < 0. 
LEMMA A. 2. For k = 0, . . . , K and any \: 

]imE\ sup Vk (Z k | A')) = E{ m (Z k | A)}. 

l||A'-A||< E J 
Lemma A. 3. For any compact set A C R K x R s with A° ^ A: 

—oo almost surely. 



lim 

n— >oo 



sup^ n )(A)-£( n) (A°) 
aga 



Proof of Lemma A.l. U k = rj k (Z k \ A) — r] k (Z k \ A°) has finite expec- 
tation by (A. 10), and Jensen's inequality yields 

J2n • E{U k ) <^k- log£{exp(C4)} < log( Y,r k ■ E{e W (U k )} J • 

k k \ k J 

(A.12) 

Equation (A.2) with G k (X k ) = exp(C4) =p k {Z k \ \)\p k {Z k \ A )]" 1 and A ^ 
A° gives (the true parameter is denoted by A here) 

• E{exp(U k )} = E*l J2Pk(Zk | A) 1 = 1 
k I fc J 

and (A.12) implies J2 k r k ■ E{U k } < 0. It remains to show that this inequality 
is strict. Suppose not; then equality holds in both places of (A.12). The first 
equality implies that each U k is constant almost surely, say U k = logc/c, and 
the second yields c k = c for all k, hence U k = logc, respectively, p k {Z k \ A) = 
c-p k (Z k | A°) almost surely. From J2 k Pk = 1 we get c= 1, and hence 

(A. 13) i] k (Z k | A) = n k (Z k | A°) for all A: almost surely. 

Then 

ipo{X k ,y k ) =r\ k (Z k | A) + t/o(Z | A) -r] (Z k \ A) - %(Z | A) =ip °{X k ,y k ) 
almost surely, and since the distributions of X k and X dominate each other, 
ipQ(X t y k ) = i{jffo(X ) y k ) for all k almost surely. 
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From Condition (C3) we get 6 = 6°. For A = (0,7*) (A. 13) gives almost 
surely 

7fc * + G k (Z k ,0) = Vk (Z k | A) - m {Z k | A) = 7fc ° + G k (Z k , 0°) for all k 
and from = 6° we conclude 7* = 7° , which contradicts A 7^ A° . □ 

Proof of Lemma A. 2. Continuity implies for any positive sequence 
e n -> 



sup rj k (z\\') >rj k (z\X). 

||A'-A||< £ „ n ^°° 



Since 



(A. 14) Vk{z\\)< sup 7fr(z|A')<0, 

||A'-A||<e n 

the dominated convergence theorem and (A. 10) yield 

e{ sup rj k (Z k \X')\ >E{q k (Z k | A)}. 

l||A'-A||<e n J n ^°° 



□ 



Proof of Lemma A. 3. For e > consider the ball B(X \ e) = {A' | 
II A' — A|| < e} with interior B°(X \ e) and let r\ k (z | A,e) = sup A / eB( - A | £ ) r/ k (z \ 
A'). Lemma A. 2 implies 

lim V r fc • E{ m (Z k | A,e)} = £> fc • E{ Vk {Z k | A)} 

e_> k k 

and for any A £ ^4 Lemma A.l gives 

5> fc • E{ m (Z k | A)} < J2^k ■ £{%(Z fc | A )}. 

k k 

Hence there exists an e\ > such that 

(A.15) 5> fc • £{%(Z fc | A, £\)} < £> fe ■ E{ Vk (Z k | A )}. 

k k 

Since A is compact, there are finitely many X±, . . . ,Xm £ A such that for 
any AG A there exists 1 < m < M with A £ B°(X m | e Am ). Thus r] k (z | A) < 
t] k [z | A m ,e Am ) and 

(A.16) sup^)(A) -£W(\°) < Max£$> fc (Z fci | X m ,e x J - m (Z ki | A )]. 
For each m the strong law of large numbers gives almost surely 



i™n^5Z5Z[^( z ^ I a ™> £ aJ -Vk(Z ki | A°)] 

^> fe • [E{ Vk (Z k | A m ,e A J} - E{ Vk (Z k | A°)}] < [cf. (A.15)] 



ra— >0 n . 
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with finite expectations by (A. 10) and (A. 14). Hence 

lim V V[r/fc(Zfci | A m ,e Am ) - rj k (Z ki | A )] = -oo 

and the right-hand side in (A. 16) tends to — oo for n — > oo almost surely. □ 

Proof of Theorem 4 (consistency). For any e > 0, the function 
^( n )(A) attains its maximum within B(X° \ e). We show first that (almost 
surely) the maximizing argument lies (for almost all n) in the open ball 
B° (A | e), and hence is a solution of D\ft n '{X) = 0. Applying Lemma A. 3 
to the boundary A e = dB(X° \ e) yields that the following statements hold 
almost surely for almost all n: 

(i) sup AeAe ^)(A)<^)(A°), 

(ii) su P||A _ A o 1| < £ ^)(A) <sup | , A _ A o ||< ^W(A), 

(iii) there exists £M G B°(\° | e) with D x £( n \\^) = 0, 

(iv) £( n \\) is strictly concave [cf. (6.1), (R2')], 

(v) there is a unique X^ G B°(X° \ e) maximizing ^"•'(A), 

(vi) AW = A(") [cf. (CI)]. 

This proves (a), (b) and also (c), since e was arbitrary. □ 
A.4. Proof of the results in Section 7. 

Proof of Theorem 5 (normality). The (standard) proof is only 
outlined. For U( n ) = Z? A ^( n ) the central limit theorem and (5.6) give 

(A.17) n-^U^A ) —^U Af(0,S(A°)). 

n— >oo 

A first-order expansion about A yields 

n~ 1/2 U( n) (A (n) ) = n-^UW (A°) + D n • v^[A (n) - A ] 

with 

D n :=- f 1 DxV^iX + t[X^ - X°])dt 
n Jo 

= -1. f 1 j(")(A° + t[A (n) - X°})dt 
n Jo 

and Condition (Nl) implies 

D n • V^[A (n) - A°] + n-VZuW (A°) — ^ 0. 

n— >oo 

D n can be replaced by its limit — I(A°) from Condition (N3), that is, 
V^[A (n) - A°] - n-^F^A^U^^A ) — ^0, 
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which together with (A. 17) establishes (a). And (b) follows in view of The- 
orem 3(b). □ 

Proof of Theorem 6. Keeping the notation from Appendix A. 3, the 
partial derivatives of 

%(z | A) = logp fe (z | A) =1 * k + G k (z,0) - logfcexpfrf + G,(z,0)]^ 

up to order 3 are given by 

D ^mi z I A ) = A fcm -p m (z | A), 

D 2 rmXsVk (z\ \) = -D Xs p k {*\ A), 

D^x B x t Vk( z I A) = -D XsXt p k (z | A), 

D^z | A) = ]T[A M -p,(z | A)] • £> flr G,(z,0), 
i 

£>^7fo(z I A) = £([A W -Pi(z I A)] • d£.0.G,(z,0) 
i 

-D fl .p,(z|A)-D tfr G,(z,e)), 

i^ 3 rMt %(z | A) = XXlAw | A)] • Efy^GfaO) 

i 

-De s pi(?\X)-DleM^ e ) 
-Dl et p l {-L\X).De r G l {x,e)) 

with partial derivatives [cf. (A.l)] 

D Xs p k (z | A) =p fc (z | A) • D Xa r] k (z | A), 
D lxM z I A ) =^( z I X )i D ixM z I A ) + ^A s %(z I X)-D XtVk {z | A)]. 

Next we deduce from Condition (MC) a weaker moment condition, from 
which Conditions (N3) and (N4) will be derived (cf. Lemma A. 4): 

Condition (MC)~. There exists e° > such that for B(X°) = {A | 
II A — A 1| < e°} and all k = 0, . . . , K the following functions: 

sup \D 3 XrXsXtm (Z k | A)| with Z fc = h x (X k ) 

AeS(A°) 

have finite expectation for all r, s, t = 1, . . . , S and I = 0, . . . , K. 
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For the above derivatives we successively get the following bounds, where 
the fixed argument z is omitted: 

\D irnVk (X)\<l, \D er r) k (\)\<H+(0), 

\D XrVk (\)\<H* + (0):=l + H + (0), 

\D^ XaVk (\)\ = \D Xs p k w\ < \Dx s mW\ < fl^W, 

\Di esm (\)\<H ++ (e) + H + (0) 2 , 

\Dl sXtr]k w\<n(8) 2 + H ++ (e), 
\Di Xt p k w\<2K(0) 2 + H ++ (0), 

\D^ XsXt ri k (X)\ = \Dl sXt p k (X)\ < 2H* + (ef + H ++ (0), 
\D$ resetVk (\)\ < H +++ {6) + 3H* + (e)H ++ (e) + 2H* + (6f. 
Taking (for fixed z) the supremum over the ball B{9°) gives 
sup < 1 + ^ sup H r , 

r 

sup Hf < 1 + 2 SU P H s + Yl SU P Hst ' 

S St 

sup Hf < 1 + 3 snpH r + 3 ^ ^ sup # r # s + ^ ^ ^ sup H r H s H t , 

r r s rst 

sup # ++ < sup ' 

sup# +++ < sup Frs *' 

SUp #| • H ++ < X! 51 I SU P ^ + SU P H r H st] ■ 

rst 

Condition (MC) obviously implies for i = 1,2 that 

sup H r (Z k \6)\ sup F r (Z fc I 0) ■ F st (Z fe I 0) 

0GB(0°) 0eB(0°) 

have finite expectation, too. Hence 

sup sup \D 3 XrXeXt rn(Z k | 0,7)| 

T 0£B(0°) 

has finite expectation for any r,s,t and any This proves Condition 
(MC)~ and Lemma A. 4 establishes the theorem. □ 

Lemma A. 4. Conditions (N2) and (MC)~ imply Conditions (N3) and 

m). 
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Proof. Using (6.1) for A = A to establish Condition (N3), it suffices 
to show for any s and t that 

(A.18) I ['[j^^ + t[\^ - \°]) - j£\\°)]dt-^0. 

n Jo n— >oo 

From 

4t ] W = - E E ^lA t ^(Z fc4 | A) with Z ki = h x (X kt ) 

k i 

a Taylor expansion gives for any e > and ||A — A°|| < e 
±\j£\\)-J st (\°)\<eS {n \e), 

^ (n) ( £ ) = ^EE SU P \\Dlx s xM Z M I A')||. 



k i 

The strong law of large numbers yields 



n ~ ~ IIA'-A°||<£ 



S n (e) >y2r k E[ sup \\D'{ X x r] k (Z k \ A')|| almost surely, 

n ^°° k V||A'-A°||< e 



A ( n ) 

where the limit is finite by Condition (MC)~ for e < e° . For ||A — A°|| < e 
we thus have 



- + t[\w - A°]) - j s ( t n) (A o )] dt 

n Jq 



<- sup |J s ( t n) (A)-J si (A°) 

n ||A-A°||<e 

<eS {n \e) 



which in view of Condition (N2) implies (A.18). And Condition (N4) fol- 
lows similarly. Note that if almost sure convergence A^ n ^ — > X° is assumed 
instead of Condition (N2), then the above arguments establish almost sure 
convergence in Conditions (N3) and (N4), too. □ 
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